-
-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Experimental) Add support to NTK RoPE scaling #118
Conversation
Important update: Before, the alpha value wasn't being applied correctly. Now, it does it correctly, and thus, just by setting alpha for NTK RoPE scaling would be enough (without the need to set compress_pos_emb to the same value) Also, added perplex on a test of a 30B model. |
I might refactor this a bit later, but it seems okay. I'll merge it as is for now. |
Hi, I'm confused with the current code. It seems like compress_pos_emb is still used beside alpha value especially if we set compress_pos_emb not equal 1 to scale the value of t. But, dynamic scaling seems didn't do that. Is this intentional? |
If compress_pos_emb is set to 1, the rotatory embedding base is still set at 10000. (As if nothing changed) Ideally you just want to set either compression or alpha, not both at the same time (for example, do not use compress 2 and alpha 2) Also, this implementation of NTK is static RoPE scaling. Dynamic NTK Scaling isn't implemented yet on exllama (it depends of the context length when generating) |
Oh thank you for the clarification. So the base value is static based on the alpha. correct? Then after that, we could generate with context more than default context size? If I check the graph from this reddit post, with alpha 4, I could generate with context size 5000 with perplexity not exploded just like the yellow line in the graph? |
@fahadh4ilyas correct. |
This adds support for the new NTK RoPE scaling, mentioned in #115.
"According to this post, this is a method of rope scaling that result in less perplexity loss and a bigger possible scaling:
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/"
Adds the parameter "a", "alpha", which is used when loading a model with "-a"
Tested on 65B models at 4K context, with 48GB VRAM (2x24) using gs 16,20
Perplexity:
For tulu-30B-GPTQ (non-SuperHOT)
For Tulu-30B-SuperHOT-8K-4bit-32g:
Note: For 8K ctx and above, I suggest to keep using SuperHOT.